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ABSTRACT 


The purpose of this study was to estimate the 
reliability of gymnastic ratings and the reliability of 
gymnastic raters. It was also part of this study to 
compare four methods of assessing the performance score of 
the athletes. 

In order to estimate the reliability of the ratings 
the analysis of variance was used. The coefficients of 
reliability of the ratings were closely related with the 
range of ability of the athletes. Higher coefficients were 
obtained for a heterogeneous group of athletes whereas lower 
coefficients were observed for a more homogeneous group. 

The reliability of each judge was determined by the 
principal components method of factoring. A uni-factor model 
was proposed and it was suggested that the largest eigenvalue 
extracted from the factorial analysis of the ratings of 
each event be used to estimate the quality of the ratings 
and the raters. 

Finally, the comparison by rank correlation, of four 
methods of assessing the performance score did not suggest 


to any large extent the superiority of one method over another. 
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CHAPTER 1 
THE PROBLEM 


Introduction 

In many areas of human performance such as 
gymnastics, diving, skating and skiing, it is necessary 
to rely on the judgement of raters to assess the quality 
of a performance. There has probably never been a 
gymnastic competition where the athletes, the coaches, the 
organizers or even the spectators have not expressed bitter 
feelings about the more or less subjective ratings of the 
judges. As was pointed out by Festa (1963): Are the judges 
capable? or are not the people who criticize the judges 
prejudiced in favour of their own work? 

The problem of having competent and objective 
judges at all levels of gymnastic competition has been 
difficult to solve. To this effect, Roetzhein and Muzyczko 
(1968) have indicated the necessity of instituting an 
objective and meaningful national ranking system for judges. 
At the present time, the test suggested by the FIG 
(Fédération Internationale de Gymnastique) does not seem 
sufficient to identify competence and objectivity in judging. 
Landers (1970) stated: 


It is therefore surprising that a sport which relies 
so heavily on human observers to determine the 
outcome of competition does not have more research 
knowledge concerning factors which may influence the 
accuracy of judges' scores. 
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FIG Code of Points 

In order to improve on the objectivity of the 
judges in gymnastic competition, a set of rules was 
established in 1949. According to these rules (1968), 
four judges are used to determine the score of each 
performer. These four judges are supervised by a superior 
judge who also gives his rating of the performance. The 
net score of each gymnast is determined by taking the 
average of the middle two scores of the four judges 
mentioned above. This procedure is further governed by 


the following articles. 


Article 11.1. All exercises are scored with points 
ranging from 0 to 10 with deductions of whole points, half 


points and tenths of points. 


Article 11.2. The points difference between the two 
middle scores may not be greater than: 

0.10 with an average of 9.60 or higher 

0.20 with an average of 9.00 to 9.55 

0.30 with an average of 8.00 to 8.95 

0.50 with an average of 6.50 to 7.95 

0.80 with an average of 4.00 to 6.45 

00min all vothervcases 

When the difference between the two middle scores 
does not fall within the above range, the superior judge 


is consulted. 
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3 
Article 9.0. His mark (superior judge) added to the 
average of the two middle marks of the four judges, 
divided by two, is the valid basic score. It is used for 


possible intervention in consultations when needed. 


Article 17.0. The evaluation of optional exercises 
takes place on the basis of three evaluation factors: 

Cay DITErculrecy 

(b) Combination (formation of the exercise) 

(c) Execution 
The points are distributed in the following manner: 

fay DrerircuLlty 3.40 points 


(b) Combination 2.60 points 


(a + b) 6.00 points (the actual value of 
an exercise) 
(c) Execution 4.00 points (for correct form and 
technically correct 
execution) 


(a + b + c) 10.00 points 
For the evaluation of the final competition, the 


jury in each event, as described by the FIG procedure, must 


be composed of: 


Article 47.0. (b) Two superior judges and four judges 
of which one head judge and four judges must come from 
nations not participating in this event (neutral judges). 

(c) If there should be a discussion 


and no common understanding be found between the two superior 
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4 
judges the score to be given by the superior judges will be 


the average of their individual scores. 


Statement of the Problem 

A gymnastic competition involves performance in Six 
different events: floor exercises, horizontal bar, parallel 
bars, pommel horse, vault and rings. The foremost purpose 
of this study was to assess the reliability of gymnastic 
judges and the reliability of gymnastic ratings as obtained 
at four levels of gymnastic competition. 

In addition, four methods to assess the performance 
score were compared: 

1. The FIG method where the performance score 
is the average of the middle two ratings given by the four 
judges. 

2. The unweighted composite method where the 
performance score is the average of the ratings of all the 
judges including the superior judge's rating. 

3. The two most reliable judges’ method where the 
performance score is derived from averaging the ratings of 
the two most reliable judges as obtained by the principal 
components method of factoring. 

4. The Wescnued composite method where the 
performance score is a weighted and rescaled score derived 


by the principal componentsmethod of factoring. 
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Samples 


For this study, the ratings of four gymnastic 
competitions were analysed: 

1. The National Gymnastic Meet held in Toronto in 
1973. For this meet there were three levels of competition: 
optionals, compulsories and finals. The results for the 
junior men and senior men were analysed. 

2. A dual meet between Canada and China held in 
Montreal in 1973. 

3. The national trials held in Winnipeg in 1973. 

4. A western intercollegiate meet held in Edmonton 
Pe A971. 

All together the ratings from fifty-four events 
were used for the study. The selection of the four meets 
was based solely on the availability of the results, and 


for this reason only the men's competitions were analysed. 


Ssigniticance Of the Study 


As previously indicated, very little research has 
been conducted to determine and evaluate the quality of the 
judging associated with gymnastic competitions. 

Results from this study and subsequent similar studies, 
if done, could help the Canadian Gymnastic Federation in the 
selection of methods to establish standards related to the 
quality and objectivity of gymnastic ratings. It is 


understandable that the level of agreement between judges in 
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assessing gymnastic performance is higher in competitions 
where the athletes fall within a wide range of abilities 
and lower in situations where the range of abilities is 
narrower. Since the reliability of the ratings gives an 
estimate of this level of agreement between judges, the 
results from this study could be used as a starting point 
for the creation of standards of agreement between judges 
associated with gymnastic ratings. 

Secondly, the assessment of individual judge 
reliability would certainly assist in making a better selection 
of judges based on an objective evaluation of their previous 
ratings. 

Finally, a new ranking system of athletes obtained 


from a weighted composite score may be worthy of consideration. 
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CHAPTER II 
REVIEW OF LITERATURE 


As was noted in the introductory chapter, the 
FIG adopted a new system of rules in 1949 where the 
judging panel was composed of one superior judge and four 
Other judges. The performance score was determined by 
taking the average of the middle two scores of the four 
judges. The superior judge would have the power to use 
his score if the difference between the two middle scores 
did not fall within the range set by the FIG (1968, Article 


Wik 2h )> a 
RESEARCH IN GYMNASTIC JUDGING 


So far, little research has been done to evaluate 
the objectivity and the reliability of the judges selected 
to assess performance in gymnastic competition. However, 
four different commonly used approaches are revealed in the 


literature. 


Gross Scores Versus Net Scores 

In this first approach, the purpose was to compare 
the standing of each competitor by taking the average of all 
the scores given by the judging panel (gross scores) to the 


conventional method of averaging the middle scores (net 


scores). 
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Results from the 1950 National Collegiate Athletic 
Association Gymnastic Meet were analysed by Hunsicker and 
Loken (1951). The gross scores were compared to the net 
scores and it was observed that differences in rank resulted. 

Similar results were also observed at the National 
Collegiate Athletic Association Gymnastic Meet held at the 
University of Illinois, in April 1961. In this study, 
Faulkner and Loken (1962) used the average of the four 
judges (excluding the score of the superior judge) and 
compared these results with the average of the middle two 
scores. 

So far this method does not give any evidence of the 
superiority, in terms of objectivity and reliability, of 
one method (gross scores) over the other (net scores) to 


assess the performance score. 


Range of Scores Versus Quality of Judging 
A second approach was proposed by Calkin (1968, 1969) 


based on the assumption that the greater the range between 
scores given by all judges over the same performance the 
poorer the quality of judging would be. In order to take 

a "more objective look at a judge's performance", Calkin's 
first study analysed: (1) the number of times each judge's 
score was discarded because it was too high; (2) the 

number of times a judge's score was discarded because it was 


too low; (3) the number of times a judge's score was more 
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than the FIG range above the meet score; (4) the number of 
times a judge's score was more than the FIG range below 

the meet score; (5) the mean of all the scores given by each 
judge for each event. 

In a follow-up study, the same author (1969) proposed 
to evaluate the work of individual officials and compared 
their ratings over a season. From these analyses it was 
concluded that events that were rated having a small rather 
than large difference between judges' scores were "objectively" 
better judged. 

This method has helped in evaluating the performance 
of a judge but so far it has not been useful in the selection 


of judges in terms of their objectivity and reliability. 


Intercorrelations 

A third approach, and one that has been used more 
frequently, was the utilization of the product moment 
correlation to determine the degree of agreement between the 
ratings of a judging panel. 

Using inter-judge correlations, very low coefficients 
have been observed by Faulkner and Loken (1962) in the 
parallel bars, tumbling and free exercises events (0.41, 
Oa46y:+sOell, 0.34, 0.27). 

However, in another study when the intercorrelations 
among the five judges for the six events were determined by 
Hunsicker and Loken (1951), only one fell below 0.80. This 


would indicate a high level of agreement between the judges. 
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10 
The purpose of Calkin's studies (1968, 1969) was to 
analyse the ratings of gymnastic judges. The inter- 
correlations among judges were given but there was no 


indication that the data used were real or not. 


Bauer Method Versus FIG Method 

A fourth method has been suggested whereby gymnastic 
routines would be evaluated on film. It was felt that this 
method would control external factors such as the effect of 
an audience, the presence of other athletes, the judges 
changing scores after a meet, the presence of other judges, 
etc. It would thus be possible to use more adequate statistical 
techniques to determine the reliability of each judge. 

This last approach was used by Landers (1969). Two 
judging systems were compared: (1) the FIG system where each 
judge assigned numerical ratings for the following categories: 
difficulty, composition and execution. The performance score 
is the sum of the ratings on the three categories. (2) the 
Bauer system where the judges rate only one of the three 
categories. The performance score is determined by summing 
one judge's rating of difficulty, one: judge's rating jot 
composition and the average of two judge's ratings of 
execution. 

Twelve qualitifed Bauer Variation judges who were 
used at the 1965 Big Ten gymnastic championship were selected 


as subjects to evaluate performance under the Bauer system. 
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For the FIG system twelve judges representing the Northern 
California Gymnastic Officials Association were selected. 

All subjects viewed an 8 mm. film containing twenty- 
three routines. The first routine was also inserted at the 
end of the film and was used to determine each judge's 
reliability. 

In order to compare both systems, an absolute rating 
of each routine was determined by the investigator and by 
a recognized authority in each of the respective observation 
system. 

The results indicated a higher reliability for the 
judges under the Bauer system (0.853) as compared with the 
FIG judges (0.619). The author pointed out that these results 
clearly demonstrated the greater effectiveness of judges 
rating one rather than several observational categories. 

But, since these results have never been duplicated 
and since great doubt should be expressed concerning the 
internal and external validity of the experiment in terms 
of the selection of the judges, non identical tasks for the 
two groups of judges and difficulty in generalizing to one 
system or the other, it would be hazardous to accept without 


question the conclusions of this study. 
ESTIMATES OF RELIABILITY 


When dealing with psychological variables such as 


gymnastic ratings one is concerned with how reliable a judge 
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12 
is in his ratings. Would an athlete, under the same 
conditions, obtain from a judge a similar score for an 
identical performance on different occasions? 

Since it is well known that ratings by any judge 
may contain a certain component of error, one common 
practice has been to have several judges rate the same 
performance and average the ratings in order to determine 
the "true performance score" of the athlete. With some 
modifications this method has been used in situations such 
as skating, diving, skiing and gymnastics. 

Reliability problems are thus related to the 
accuracy of the judges. In the area of mental tests several 
methods based on the Classical Test Theory Model (Lord and 
Novick, 1968, Magnusson, 1966 and Gulliksen, 1950) have been 


developed to assess the reliability of a measurement. 


Background: The Classical Model 


The basic equation of the Classical Model (Gulliksen, 
1950) defines the observed score as being made up of a true 


component and an error component. 
> Aa ead Lal aaa (1) 


By defining error as the difference between the observed 

score and the true score, this model considers only the 

random errors, called errors of measurement, and assumes that: 
1. The expected value of these errors over a large 


number of parallel measurements is zero, that is: 
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2 alates (2) 


2. The correlation between true scores and error 


scores is zero. 
© = 0 (3) 


3. The correlation between the error scores on 


One test and the,error scores on any other test is zero. 


x = 0 (4) 


4. The correlation between the true scores on one 


test and the error scores on another test is zero. 
r = 0 (5) 


Using the assumption that E(E) = 0, the expected 
true score is equal to the expected value of the observed 


scores, 


E(T) = E(X) (6) 
and the variance of the observed scores is the sum of the 


true score variance and the error score variance, that is: 


EE Re (7) 


Definition of parallel tests. Two tests are said 


to be parallel when the expected values ofthe observed scores 


are the same in both tests, the variances are the same and 
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14 
the errors of measurement are the same. These can be 


written as: 


E(X,) = E(X)) (8) 
2 2 

O = Oo (9) 
Xy Xp 

fe} = O (10) 
he en 


It is also part of parallel measurements that all inter- 
correlations between the tests are equal (Lord and Novick, 


p. 48, 1968). 


Reliability of parallel tests. Reliability can be 


defined as the correlation between parallel tests. Starting 


with the equation of the correlation coefficient where 
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defined, then the correlation between parallel tests is 


ae = aa (15) 


So the reliability of parallel tests is the ratio of true 
score variance to observed score variance. Similarly the 
correlation of true and observed scores is expressed as 


a 


= xt 
Pxt a N® oho (16) 
xc 
Since a,» Sa 0, then 
oF 
a G (17) 
XnFt 


By dividing both the numerator and denominator by O,, we 


write 
oO 
te 
Ve = — (18) 
xt oe 
So 
2 
a 19 
T xx Pxt eo) 
OF 


= \/ (20 
Ext z XX ) 
The correlation of true and observed scores as expressed by 


this last equation has been known as the index of reliability 


(Gulliksen, p. 23, 1950). 
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16 
The Spearman-Brown formula. Previously we have 
defined reliability as the correlation between parallel 
fests, that is: 
io 
1g = a (21) 
O 
x 
When a test is increased in length by adding K parallel 
tests, the variance of the composite test composed of the 
sum of all the tests from 1 to K, is given by 
K K 
ae: Rags ae 2 ag Oo Oo (22) 
g=1 g g=l h=1 Gr at g h 
g#h 
Since the variances of each test are equal and since we have 


K (K-1) covariance terms, we can write 


o2 Se hae Re ee o2 (23) 
x x << x 
tot ie bh 


which reduces to 


v2 2 
= - - 24 
Dot K oie F (K-1) a «1 (24) 


Similarly, the true variance for the composite test is given 


by 


Z mttigdenk phtay (25) 
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From the assumptions of equal true variances and a correlation 


of 1.00 between the true scores, we obtain 


of = Koz + K(K-1) of (26) 
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which reduces to 
(27) 


Thus the reliability of a test increased in length K times 


is expressed by 


x? 2 
rx = 5 (28) 
K OL 1 + (K-1) are 
all 
and since 
oe 
O xx r Ux x > 2 (29) 
Op sia Oo 
x 
then ; 
Ker 
Ste 
r = > Spee aa (30) 
K 1+ (K z) i shoes 


which is the well known Spearman-Brown formula for a test 


increased in length by K parallel tests. 


Reliability estimated by split halves methods. A 


very common procedure to estimate the reliability of a test 
made up of many items has been to divide the test into two 
comparable halves and compute the correlation between the two 
halves. The coefficient obtained in this manner can be 
regarded as the reliability coefficient for one of the test 
halves. In order to obtain the reliability of the whole test, 
the Spearman-Brown formula is used to correct for halving the 
test length. 

A similar procedure was established by Rulon (1939). 


The test was divided into two halves but the assumption of 
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equal observed variance was not necessary. His reliability 


coefficient was expressed by 


74 
Txx = ir imegte x, 
tot 


It was shown that the variance of the differences in the 


(31) 


observed scores between the two halves was equal to the 


error variance, and the reliability was given by 
2 
5 = thd averas (32) 


Prot 


which is equivalent to equation 15. 
A simpler equation which achieved the same result 


as Rulon's method was derived by Guttman (1945) and was 


expressed as 


2 2 
a; eae fs! Go) 
yx 2 1 Kg Xn (33) 
2 
Prot 


When all items of a test are considered to be parallel 
to each other, there are many ways to divide the whole test 
into two halves. It was this problem of getting an average 
reliability coefficient for the whole test that led Kuder and 
Richardson (1937) to derive the following. 

The total variance of a test composed of K items 
is given by 
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Since of = of we can write 
2 2 —— — 2 
Oo Se 0 me = Fy ; 
tot a oe Ti (35) 
K (K-1) 

2 : ; 

where Or ot 1s the total test variance, 
a is the variance for one item, 


K is the number of items, 


is the average correlation between the K items, 


28d, 
o. eee the average variance for the K items. 
But 
Zz 
cin See aries ) (36) 
z K 
So 
oe Se MSD 2 
ri5 ile Ca: Zoe (37) 
2 
(K-1) 105 


This last expression gives us the average correlation among 
the items. Because we have assumed equal intercorrelations 
among the items, ig also can be intepreted as the 
reliability coefficient of a single item. In order to obtain 
the reliability of the whole test, the Spearman-Brown formula 


is applied and we get 
fey a ae) (38) 
1+ (K-1) Ti5 


Replacing the value of Tig in equation 38 by equation 37, 


we write 
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2 it 
Oo Z 
G = K - Yo. eee, 
k tot = o2 ; 562 (39) 
(K-1) Zoe 1 + (K-1) Sf ___4 
(K-1) Zo’ 
i 
which reduces to 
1 
2 2 
aes froth Se E z : 
k K- . . = 
i oe LO. + Cray Los (40) 
Bo? 
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- io K Ptot oe o2 
k K-1 : 2 cOt (41) 
Ou 
2 
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and finally to 
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Equation 42 is generally known as the K-R 20. The K-R 20 


becomes a special case of Cronbach's coefficient alpha (a) 


which is given by 


2 
EiGis 
a K os 1 
ee Re ay (43) 
TO 


where a4 is the variance of each item after being weighted 
and ce is the total variance of the test made up of weighted 
items. The coefficient a is algebraically equal to "the 


average of all possible split half coefficients of a given 


test". (Cronbach, LI5k, p. 300) 
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The Ré ability of Ratings 


The preceding discussion dealt with the reliability 
of tests made up of many items. Many split halves methods 
were presented and among them Rulon's common procedure from 


which the reliability of a test is given by 
o 
ore en w (75) 
tot 


Form the model, Hoyt (1941) suggested that the variance 
of the differences (04) was a measure of discrepancy 
between the observed variance and the true variance. 
Depending on how fortunate or unfortunate one is in dividing 
the test into two comparable halves, the reliability of the 
whole test could then be overestimated or underestimated, 
In order to seek a better estimate of this error variance 
Hoyt applied the analysis of variance to items scored either 
One or zero. 

The "between subjects" and the "between items" sum 
of squares were subtracted from the total sum of squares 


in order to estimate the error variance which was given by 


SS, = SS, - (SS, + SS;) ae) 


From the definition of reliability 
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the estimation of the coefficient of reliability based on 


the analysis of variance was given by Hoyt as 


2 en cea "(= Br | (47) 


The author pointed out that the result obtained using the 
analysis of variance was equivalent to the result using the 
Kuder-Richardson formula 20. 

Several authors have pointed out the use of the 
analysis of variance techniques in estimating the reliability 
of ratings (Horst, 1949; Ebel, 1951; Burt, 1955; Mahmoud, 
1955; Engelhart, 1959; Maxwell and Pilliner, 1968; and Winer, 
POEL) ts 

Essentially the development based on Winer's approach 
(1971, p. 273-296) is as follows: 

the ratings received by a subject "i" from a judge 


"j" is expressed as 


Xtho Soret TT; + T. te.. (48) 


where wu is the grand mean of the ratings. 


T. is the true score component associated with 
z 


subsect “i. 
t. is the difference between the mean rating of 
judge "j" and the grand mean "i". 
€i5 is the error of measurement associated with each 
subject and each judge. 
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This structural model assumes that the error component is 
normally distributed for each judge and the variance of 


errors is equal for all K judges that is 
Oo =". ties meen o =o (49) 


Similarly the true score component 1 is also assumed to be 
normally distributed and does not vary from one judge to 
another over the same subject. 

Assuming ur and e. uncorrelated, the variance of 


the ratings of judge "1" is given by 
a = a2 + «2 (50) 


Similarly the variance of the ratings of judge "2" is 


expressed by 
o2 = o2 + 5° (51) 


Under the assumptions that the errors are uncorrelated 
and that the correlation between error and true scores 


is zero, then the covariance becomes 


6 BGA (52) 


To express the error variance we have 


oe + oa = 2 (02 + 07) (53) 
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Dividing both side of the equation by 2 and replacing 
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2 2 
oy. Sei 
1 2 - on F2 (54) 
>a x = 
oul a5 e 


which can also be defined as the difference between the 


mean variance and the mean covariance of the ratings that is: 


VER nCOm = oe i (55) 


Using the reliability formula previously defined 


by 


2 
- = an (56) 
Xx 6 
x 
and replacing oa? by 6 , we get 
te xX. xX 
= hemp 
> eae: 
r= = 4 (57) 
xX 6 
x 
which is equivalent to 
oF 
12 = (58) 
XX sane o2 
e T 


From the analysis of variance model, the error variance 

(02) can be directly estimated by the mean square residual 
(MSpag)> However, of cannot be directly estimated by the 
mean square for subjects (MS.)- Rather, it represents K times 
the variance of the means for each person, and each mean 


consists of a true component and an error component. Therefore, 


the true score variance can be estimated by 


MS. - MSpec ie 2 (59) 
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Since the error variance is estimated by the mean square 
residual, the reliability of a single rating expressed by 


equation 58 becomes 


ae Moa icon. 
i K (60) 
MSpEs + MS, - MSprs 
K 
which reduces to 
r= K (MS5 — MSprg) | (61) 
KIMS. - MSpag + K(MSpng)] 
and to 
Mere ric tie RESe nes 
1 MSo + (K-1) MSors 


This last equation represents the average reliability of 
the ratings which has been called by Ebel (1951) the 
intrselase coefficient of reliability. 

Applying the Spearman-Brown transformation to 
equation 62 in order to obtain the reliability of the average 


ratings, we can write 


= 1 
Me a ees eee RES oe 
K MS, + (K-1)MSpac S RES 
1+ (K-1) MS, + (K-1)MSpag 
(63) 


which reduces to 


| a 


‘vee | | 
\ S8s0pe nsom ais yd betemites. at beirehtacnd Te 


(a paves ofpate 6 to er ian i: 


49 
; a aM 7 : 
(08) Se Fae 
a te | 
7 
og 
i 
(La) ‘aaaeet) *)"9 Ne 
gage TF” age 2 


(sa) | hr See 
aust a 


wi a) ytid bdied Sox SRSISVS sit ataozezqex nos dupa\d i. 
eno (ECL) ded® yt bolles aesd est ye 
‘Wiltdetiax to dnainitieen 


04 dmb tics moxs-anmaBsge ad nee Bi 
‘Spatsvs oft Jo ysilidsifs: oft WiEseo on Yéps0 mi S34 


S RES 
re = KK 
K MS, + (K-1) MS, 
and to 
i ae MS. - MSprs 
K MS, + (K-1)MS, 


andatrenaliy to 


As Ebel pointed out (1951, p. 
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RES 


MSnck (K-1)MS, 
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(64) 
+ (K=1)<MS 
RES 
(66). 


variance should be removed 


where the final ratings on which decisions are 
based consist of averages of complete sets of 
ratings from all observers or ratings which have 
been equated from rater to rater such as ranks, 


Z-scores, etc. 


But if one wishes to include the judges' bias in situations 


where: 


decisions are made in practice by comparing single 


"raw scores" assigned to different pupils by different 


raters, or by comparing averages which come from 
different groups ofraters, then the "between-raters" 
variance should be included as part of the error 


terms. 


Following Winer's model, the "within-judges" mean 


square (MS) is used instead of the mean square residual to 


estimate the error variance. 


In this case the reliability 


coefficient of a single rating is given by: 
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MS. - MS. 
pipes | Meee (k=) MS (67) 
WwW 


and the reliability of the average ratings is expressed by: 
r = a ee (68) 


Reliability of Raters 


The previous discussion dealt with the estimation 
of the reliability of the ratings. It was shown that the 
analysis of variance with repeated measures (Winer, 1971, 

Pp. 273-296) would provide such an estimation. Because 

ratings of the same performance vary from one judge to another, 
due to errors of measurement or consistent bias, it would 

be of interest to assess the reliability of each rater. 

The factor analysis model and specially the uni- 
factor model as presented by Overall (1965) gives a solution 
to the estimation of rater reliability where it would be 
inappropriate to use the test-retest model. 

One of the purposes of factor analysis is to explain 
the total common variance found between variables (Mulaik, 
1972, p. 97). Generally this common variance is referred to 
as the communality of a variable and defined as the portion 
of the total variance that a variable has in common with 


the other variables in a given correlation matrix (Wrigley, 
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In the case of gymnastic ratings where non-zero 
correlations are observed, it follows that there exists 
some common variance among the judges. It is this common 
variance, based entirely upon true components, that accounts 
for the correlations between a judge and the others. 
Furthermore, perfect correlations would rarely be encountered 
and this lack of perfect correlation suggests the presence 
of unique variance that contributes to the total variance. 
Factor theory assumes that the unique variance is composed 
of two parts: variance that is reliable but specific to a 
test and error variance that arises from test unreliability, 
that is error of measurement (Wrigley, 1957). 

In summary, the total eee of a test is made up 
of common variance or communality (h2), specific variance 
(s*) and error variance lat) which is referred to the 
unreliability of the test. The specific variance and the 
error variance forms the unique variance eo). 

When the variables are expressed in standard-score 


form, that is with a mean of zero and a variance of one, 


then 
2 2 2 
the total variance =l=h +s +e (69) 
; 2 eZ 
the communality =h =l=u (70) 
, 2 2 
the unique variance =u’ =1-h (71) 
2 
the speciiic variance = 3- a u? -e Cre) 
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the true variance = h* + 5 (73) 


(74) 


II 
@ 
II 
= 
I 
Kh 


the error variance 


From the classical test theory model the reliability 
of a variable was previously defined as that part of the 


total variance that was true variance. It was given by: 


of 

a a = (75) 
wy 
x 


Replacing equations 73 and 69 respectively for the true 


variance and the total variance the reliability is defined 


as: 
2 2 
e hip 8 
Uxx 2 2 2 (76) 


Nye, So ore 
Since the total variance expressed in standard-score form 


-is 1, equation 76 becomes 


r =e + gs (77) 


Factor analysis does not provide for the estimation 
of the specific variance. However, it can be stated that the 
communality of a variable is at least a lower-bound estimate 


of the reliability of that variable as expressed by equation 


1S 


Common factor analysis. One of the purposes of 


factor analysis is to extract from the correlation matrix 


the minimum number of factors that will account for the common 


Hv & i 
7 | 
a ee 
4 
aa! .\) ee oe We wR 
a . ; r 
a | (BT). Pe ae te iW 
7 . ; n . 
ui YIiibdsiles ent Iahom yrosd? 4+e9d Isoksesio ods mort | 
q 7 
¥ efi to dyeq tert cs heniteS ylaveivexg, esw sf 
yi *¥d mevip easw JI .soceitsv efud esw teddé eos: 
| | 25 
(23) —— = xa? 
D 
x 
sums aft 10F ylevisosqeoy 00 bas. CY enoiteope eri 
beniteb ei ytilidetfear efi sonsiasy {sdod¢ eft bns 
Soe .. , ae 
(av) | ee | eee 
~o 4 As hy S a ; 
t mxo?. sicge-bisinssa nt Seessugyxs sonsitsy ideas + 
i eeitaosd av notspype: fl 
a sgh — 
a (TX) a> + “die alt 
va a4 \. 
in or ‘RoLseml jee ett 162 sbivougy dor ee0b siaviane 169557 


x a 
‘ 
ea sacs bodede ad sed IL \z9vewoH o0a8 busy ob ifosae 1 Te 


Fn. . 
oo _ Stsintte. bnwod-z3H0h s tessl 36 ei eldsizsy 6. to vette bisa 


a 


i nokesups ve bevestges 26 Sidsixey t6Ht 0 yaitidakts Moet 


ar Py U | 
ae oa a.) 4 hy ok) t. : ; . | . 
as ; v4 . ~ os ; ; aso 
eh | 3 : c ee 


) to 200 “il 


acne 3 


hic Es ion OD 


oe 7 ie: bial wetaic 3 


oni, 


; eh i 


4 fi _ 
sl ate i 


re 
Ae 


30 
variance among the variables and will reproduce as close as 
possible the original correlation matrix. To accomplish 
this task several methods have been developed such as: 

GBs the tdgagonal method of factoring (Mulaik, 1972), (2) 
the centroid method of factoring (Harman, 1967), (3) the 
principal components method (Hotelling, 1933), (4) Image 
analysis (Guttman, 1953) and (5) Alpha factor analysis 
(Kaiser and Caffrey, 1965). 

One way of determining the common variance is to 
extract the factors one at a time. In this manner the 
correlations between each variable and that factor are 
found. The contribution of that factor to the common variance 
is then partialled out from the original correlation matrix. 
From this point we proceed to find the second factor on the 
remaining common variance. Each succeeding factor is 


obtained in the same manner. 


The factor model. From the factor model a variable 


"J" can be expressed as: 
Z, af ae Be, a PET eke Pe eeharetens + az F + a,Uy (78) 


According to this model the common variance for variable "J" 
is accounted for by the factors Fi to Ee whereas the unique 
variance is determined by factor U. It is assumed (Mulaik, 
1972, p. 103 and Harman, 1967, p. 17) that the common factors 


are uncorrelated with one another, that the common factors 
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are uncorrelated with the unique factors and furthermore 
the unique factors are uncorrelated with one another. 
From the model, the common variance for variable J 


can be expressed by: 
) = a a Pd lfes tures be +a (79) 


This is also an expression for the communality which is 
given as the sum of the squared loadings on the common 


factors. 


Factor solution. When an investigator is interested 
in representing by a single score the performance assessed 
by a group of raters, as in gymnastic competition, the 
simplest way is to sum up all the ratings and form an 


unweighted composite score such as: 
Xx ma eX teats ta fers) o ote e irk (80) 
The expected value of the composite score is given by: 
E(X,) = E(X, + Xo + <--->. Xy) (81) 
and the variance by: 
Se Melts ee OY) (82) 
e e s) 


In the case where the expected value of the composite 


score is zero, the variance becomes: 


Zia se 2 83 
Oe E(X)) (83) 
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Because ratings vary from one judge to another in the 
assessment of a performance one could seek to understand the 
Belative .contribution,oOf each rater to the composite score. 
A solution to this problem can be found by differentially 
weighting each component and partialling out their contri- 
bution to the total composite. The principal components 
method of factoring can be used in such a situation. 

To form the composite score, a set of linear weights 


is applied to the components such as: 
X (84) 


Similar to the unweighted case the variance of the weighted 


composite is given by: 


eae E(k) (85) 


In terms of matrix equation, a weighted composite variable 


can be expressed by: 
x =ntw 'X (86) 


and the variance by 


Oe he E(w 'XX 'w) (87) 


When the variables are expressed in standard-score form, 


equation 87 becomes: 


co = E(w'ZZ'w) (88) 
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38 
But the correlations between the variables are given in 


. Matrix terms by: 

R = £E(22Z") (89) 
So the variance of the weighted composite is stated as: 

fo) = w'Rw (90) 


The last equation gives the solution for the variance 

of a weighted composite when the components are differentially 
weighted. However, any constant value applied to the weights 
or any multiple of the weights also gives a solution for 

the variance of the composite. In order to obtain a variance 
of the composite that is a maximum but unique, it is 

necessary to introduce a restriction. The solution is found 
by setting the constraint that the sum of squares of the 


weights used is 1 or in matrix terms: 


w'w = oe (91) 


The task consists of finding the maximum of a function. To 
do so it is necessary to resort to derivative calculus. 
Furthermore, when a constraint is used the Lagrange multiplier 


Xd is introduced for the solution of the function. So 


my. hha’ miwied e(weaceonl) (92) 


Using the derivatives for F and w, we get: 
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and setting the result equal to zero, we obtain: 


RW? 2=) Aw = 0 . (94) 


and (Rite AD):w 


Il 
i=) 


(95) 


This last equation is known as the characteristic equation 


Lcocceee se are known as the 


characteristic roots, or latent roots or eigenvalues of the 


of the matrix R, and the roots ih 


Matrix R. From equation 94 we have 


Rw = AW (96) 


A= w' Rw (97) 


But previously the variance of the composite was given by: 


oO = w' Rw | (98) 


and would have to be equal to one of the latent roots 
In EORTC A. Since the roots vary in magnitude, then the 
largest one would have to correspond to the maximum variance 
of the weighted composite when the constraint (w'w = 1) is 
introduced. Furthermore, the eigenvector associated with the 
largest eigenvalue corresponds to the set of weights to be 
applied to the variables to obtain the weighted composite. 

By this method of factoring the sets of weights 
or eigenvectors have the property to be orthogonal to one 


another and the resulting components are mutually uncorrelated. 
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Previously the communality of a variable was defined 
as the sum of the squared loadings on the common factors for 
that variable. The loadings on the first factor are then 
given by the product of the square root of the largest 
eigenvalue and its associated eigenvector. The square root 
of the second largest eigenvalue multiplied by its associated 
vector forms the loadings on the second factor. The same 
procedure is used to find the loadings on the subsequent 
factors. Then the correlation matrix estimated by the first 


factor is given by: 


NAY. 


~ V2 
R = (wy Av ) | (99) 


) (wy Ay 
Since R = waAwt by equation 97. 


Number of factors. It was previously stated that 
one purpose of factor analysis was to retain the minimum 
number of factors that would best reproduce the original 
correlation matrix. To that effect decision rules were 
developed and compared (Hakstian and Muller, 1973). Among 
those rules, Guttman (1954, 1956) suggested three approaches 


to estimate the communality of variables. From the definition 


of communality: 
Re ae, a (100) 


the problem is defined as follows: how many factors are 


necessary to account for the common variance among the 


variables. 
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The first lower-bound estimate of communality as 
defined by Guttman (1954, 1956), consists in factoring the 
correlation matrix with 1's in the main diagonal. The latent 
roots of the correlation matrix are computed and only those 
equal to or greater than 1 are retained. These also correspond 


to the number of factors to be retained. 


Reliability estimated by factor analysis. A uni-factor 


mode] to estimate the reliability of raters was proposed by 
Overall (1965). The model is based on the assumption that 
ratings made by several raters should be perfectly correlated 
if it were not for independent errors of measurement. It 
should be stated that raters may have consistent biases, but 
the variance among raters would not be affected by those 
biases. When the correlation matrix is factored by the 
method of principal components only one latent root greater 
than one should be observed and thus only one factor would 

be necessary to account for the true variance (Laforge, 1965). 


Recalling the definition of reliability based on the 


factor analysis components, that is: 
UO eg A eee (101) 


then the reliability of each judge would be determined by 
squaring the loadings on the first factor. This assumption 
would be true only if the uni-factor model holds, that is only 


one latent root greater than one is observed. In this case 
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loadings on subsequent factors would account for error 


Variance or unreliability of the judges. 
SUMMARY 


In the area of gymnastic judging four commonly used 
approaches to assess the quality of judging were revealed: 
(1) Gross scores versus net scores; (2), Range of scores 
versus quality of judging; (3) Intercorrelations; and (4) 
Bauer method versus FIG method. 

It was felt by the author that more research was 
needed to estimate the reliability of the ratings and the 
reliability of the raters of gymnastic competitions. From 
the classical test theory model, the reliability of parallel 


tests was given by: 


and defined as the ratio of true score variance to observed 
score variance. 

Several procedures to estimate the reliability of 
parallel tests by split halves methods were presented (Rulon, 
1939; Guttman, 1945; Kuder-Richardson, 1937; Cronbach, 1951). 

Many authors have pointed out the use of the analysis 
of variance techniques in estimating the reliability of 
ratings. Winer's approach (1971) was presented and the 


following four equations to estimate the reliability of 
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ratings were derived: 


1. unadjusted estimate of the average reliability © 


of the ratings: 


MS - MS 
us a Ss W 
1 MS. + (K-1)Ms 


2. unadjusted estimate of the reliability of the 


average ratings: 


3. adjusted estimate of the average reliability 


of the ratings: 


elie ki ck ei RES 
MS. + (K-1)MSi56 


4. adjusted estimate of the reliability of the 


average ratings: 


Finally, the procedures for estimating the reliability 
of raters were given. A uni-factor model (Overall, 1965) 
derived from the principal components method of factoring 


was presented. From that model the reliability of a rater 


was defined as 
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39 
and was obtained by squaring the loadings accounted for by 
each judge on the first factor. It was assumed that loadings 
on subsequent factors would account for error variance in 


the case where the uni-factor held. 
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CHAPTER cl EL 
RESULTS AND DISCUSSION 


The oymnase_e watings obtained at four. different 
levels of competition were analysed. The competitions 
were: (1) the Western Canadian Intercollegiate Athletic 
Association (WCIAA) meet held in Edmonton in 1971; (2) 
the Canadian National trials held in Winnipeg in 1973; (3) 
the Canada-China meet held in Montreal in 1973; and (4) the 
Canadian National meet held in Toronto in 1973. Altogether 
fifty-four events were analysed. 

The first purpose of this study was to estimate the 
reliability of the gymnastic ratings. Based on the analysis 
of variance techniques (Ebel, 1951, Winer, 1971) the 
unadjusted coefficient of the average reliability of the 


ratings was obtained by using 


MS, = MS. 
r= (1) 
MS, + (K-1)MS_, 


When the "between-judges" variance was removed from the 
error term, an adjusted coefficient of the average 
reliability of the ratings was obtained. The following 


equation was used: 


MS =) MS 


ee ebecrySi in UiKRES (2) 
1 MS, + (K-1)MSp ic 
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In order to estimate the reliability of the average 
ratings, the Spearman-Brown formula was used. For the 
unadjusted reliability of the average ratings the equation 


was: 


% Fae he A Cee ok (4) 


THE RELIABILITY OF THE RATINGS 


The estimated coefficients of the reliability of 
the ratings as obtained by equations 1, 2, 3 and 4 are 
presented in Table 1. For the purpose of identifying each 


event the following abbreviations were used: 


F - for the floor exercises event 
H.B. - for the horizontal bar event 
P.B. - for the parallel bars event 

P.H. - for the pommel horse event 

R - for the rings event 

VY =) forethe vauil-trvevent 


Intercollegiate Meet: Edmonton, 1971 


The lowest unadjusted average reliability of the 
ratings was observed in the parallel bars event (0.811) and 


the highest in the floor exercises event (0.902). When, 
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TABLE 1 


Reliability of the Ratings Obtained From 


the Analysis of Variance 


Unadjusted Adjusted 


Events N Ry Ri Ry Ry. 


Intercollegiate (WCIAA) Edmonton 1971 


F 26 0.902 0.979 0.905 0.979 
H.B 24 O82 0.974 OL923 0.981 
Pee. on 0.811 0.945 0.818 0.947 
P.H 28 0.824 0594900) 0.824 0.949 
R Zn 0.898 0.978 07907 0.980 
V 24 0.867 0.970 0.897 0.977 
National Trials: Winnipeg 1973 
F 6 0.814 0.956 Ors r2 OFS 56 
isleleis 6 OF550 0.859 0.689 SILT 
PB. 6 0.670 0.910 0.668 0.909 
lp opaee 6 0.840 0.963 ONS L7 OS957 
R 6 0.863 0.969 0. 857 0.968 
V 6 02520 0.844 0.492 0.829 
Canada-China Meet: Montreal 1973 
F ti One 0.930 0-755 0.939 
Hi De 12 0.913 0.981 GO. 906 0.980 
Paibe thy 0.896 Oo Te O85 0.977 
P.H 12 0.896 eee, #/ 0-920 0.983 
R 12 0.947 0.989 0.951 0.990 
V lz Oe7o2 0.941 Oey 75 0.945 
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TABLE 1 (Continued) 


Unadjusted Adjusted 
E t N 
ven Ri R. Ry Ry 


National Meet: Senior Men's Finals Toronto 1973 


F | 0.342 Oed22 | 0.399 0.769 
HB. 4 0.851 0.966 0.882 0.974 
Ee we 6 0.704 0.923 0 673 0.911 
Bedi. 7 0.856 0.967 0.862 0.969 
R 7 0.626 0.893 0.647 0.902 
V 7 0.789 0.949 Oe wo0 0.952 


National Meet: Senior Men's Compulsories Toronto 1973 


E 17 0.888 0.975 0.889 0.976 
HeB. Ly, 0.942 0.987 0.941 0.928 
B.B. 17 0.920 0.933 0.927 0.985 
Bo. LZ 0.947 0.989 0.956 0.991 
R 17 0.896 0.972 0.893 0.971 
V 6 0.650 0.903 0.646 0.901 


National Meet: Senior Men's Optionals Toronto 1973 


F LT, 0.890 0.970 0.889 0.970 
nha lale Ld 0.836 0.962 0.840 0.963 
Lae LZ 0.902 0.974 0.901 0. 97:3 
Bee tlie 17 0.936 0. 987 0.937 0.987 
R 17 0.873 0.965 0.874 0. 96:5 
V Li, 0.667 0.889 0.676 0.893 
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TABLE 1 (Continued) 


Unadjusted Adjusted 
E 
vent N Ri Ry R, Ry 


National Meet: Junior Men's Finals Toronto 1973 


F 6 02590 0.878 0). .010,5 0.866 
ib. 6 0.706 OF 923 ORS Tali) 0.946 
B.B. 6 0.866 0.970 0.860 0.968 
Pou: 6 0.814 0.956 0.799 0.952 
R 6 0-577 0.872 0. Don 0.860 
V 6 0.706 02923 0.695 0.919 


National Meet: Junior Men's Compulsories Toronto 1973 


F 22 0.761 0.941 . 0.800 02 95Z 
His Be 20 0::9:3:6 0.987 0.93.6 0.986 
PB 23 Oe iz 0.944 0.775 0.945 
aah 21 0.843 0.964 0.874 0.972 
R 22 OF 905 0. 97/4 OL 9n0 02976 
V 241 0s 83d 0.962 0.848 0.965 


National Meet: Junior Men's Optionals Toronto 1973 


F 22 0.888 0.969 0.887 0.969 
H.B. ay 0.859 0.968 0.871 0.971 
P.B. 25 0.882 0.968 0.881 0.967 
P.H. 22 0.855 0.967 0.855 0.967 
R 22 0.814 0.946 0.835 0.953 
V 2) 0.880 0.967 0.880 0.967 
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45 
for all six events, the unadjusted reliability of the 
ratings was adjusted by removing from the error term the 
"between-judges" variance, all coefficients range between 
0.818 and 0.913. In order to estimate the reliability of 
the average ratings the Spearman-Brown formula was used, 
and it was found that all estimated coefficients of 


reliability (adjusted and unadjusted) were above 0.945. 


National Trials: Winnipeg, 1973 


For the Canadian National trials, only six athletes 
were participating. The lowest coefficients for the average 
reliability of the ratings were observed in the horizontal 
bar event (0.550), the parallel bars event (0.670) and the 
vault event (0.520). For the other three events all 
coefficients were above 0.810. Similar results were 
observed when the coefficients for the average reliability 
of the ratings were adjusted. After the Spearman-Brown 
transformation, the coefficients of the reliability of the 


average ratings ranged between 0.829 and 0.969. 


Canada-China Meet: Montreal, 1973 

Six athletes from Canada and six athletes from China 
participated in the Canada-China gymnastic competition. The 
analysis showed that two coefficients for the average 
reliability of the ratings were below 0.80 (floor exercises, 
0.727; vault, 0.762). Similar results were also observed 


for the adjusted coefficients. In relation to the reliability 
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46 
of average ratings, all coefficients, unadjusted and 


adjusted were above 0.930. 


National Meet: Senior Men 

The six best athletes from the Compulsory and 
Optional competitions were selected to participate in the 
Finals. For that competition, a very low coefficient for 
the average reliability of the ratings was observed in the 
floor exercises event (0.342). For the other events the 
range for the same coefficient of reliability was between 
0.626 and 0.856. The Spearman-Brown transformation yielded 
adjusted and unadjusted coefficients of reliability below 
0.80 for the floor exercises event only. 

From the ratings of the Compulsory competition, the 
average reliability of the ratings for the vault event was 
found to be 0.650 for the unadjusted coefficient and 0.646 
for the adjusted coefficient. For all other events the 
reliability coefficients were above 0.888. The coefficients 
for the reliability of average ratings were above 0.90 for 
all six events. 

Very similar results were observed in the ratings 
of the Optional competition. The lowest coefficients of 


reliability of the ratings were found in the vault event. 


National Meet: Junior Men 


In the Final competition, very low coefficients 


for the average unadjusted reliability of the ratings were 
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47 
observed in the floor exercises event (0.590) and the rings 
event (9.577). For the vault and the horizontal bar events, 
the average reliability of the ratings was estimated at 
0.706 and was above 0.800 for the other two events. The 
removal of the "between-judges" variance from the error 
variance yielded adjusted coefficients very similar to the 
unadjusted ones. The coefficients for the reliability of 
the average ratings, adjusted and unadjusted, were all above 
0.860. 

In the Compulsory competition the analysis of the 
average reliability of the ratings yielded coefficients that 
were all above 0.761. After the Spearman-Brown transformation 
all coefficients for the reliability of the average ratings 
were found to be above 0.941. 

For the Optional competition all coefficients of the 
average reliability of the ratings and the reliability of the 


average ratings were above 0.810 for all events. 


Discussion. The average reliability of the ratings 
gives an estimation of the level of agreement in the ratings 
as awarded by the different judges. However, the estimated 
reliability of the average ratings represents the extent to 
which the judges agreed collectively and also indicates the 
extent to which another panel of judges would have agreed 
in its ratings of the same performances (Akeju, 1972). 

As was previously indicated, Ebel (1951), stated 


that the "between-judges" variance should be part of the 
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error term for assessing the average relvabirbity eof “the 
ratings when decisions are made by comparing average scores 
which come from different groups of raters. In gymnastic 
competition this kind of decision is often made. For example, 
the six best athletes who performed at the National Meet 
in the Compulsory and Optional competitions were selected 
to participate in the Final competition according to their 
average scores for all the six events in the previous 
competitions. It would therefore be of great value if some 
standards relating to the quality of the ratings were 
available. This is not the case at the present time. So 
the author will attempt to establish arbitrarily, standards 
based upon the average reliability of the ratings. 

Et Seria be mentioned, as a general observation, 
that the higher the level of competition the lower the 
coefficient of the average reliability of the ratings. At 
the National Meet, these reliability coefficients were higher 
in the Compulsory and Optional competitions than they were 
in the Final competition in which the six best athletes from 
the previous two competitions competed. The same conclusion 
was observed where the coefficients of the average reliability 
of the ratings were higher for the Intercollegiate Meet than 
for the National Trials and the Finals (Senior and Junior 
Men) at the National Meet. However, this general principle 
does not apply to the coefficients of reliability obtained 


at the Canada-China Meet. Even though superior calibre athletes 
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participated, high coefficients of the average reliability 
of the ratings were observed. 

It is therefore suggested that in order to quantify 
the quality of gymnastic ratings the following standards 
be used. When the level of competition is high, such as 
the National trials or the Finals of a National Meet, a 
coefficient of the average reliability is evaluated to be 
excellent if it is above 0.80. For competitions where a 
greater range is observed in the calibre of the athletes, such 
as at an intercollegiate meet, or the Compulsory and Optional 
competitions at a National Meet, the standard of excellence 
for the agreement of the ratings between different judges 
could be set at 02 905 

Evidently, other standards could also be established 
for lower levels of the extent of agreement between the 
ratings of the judges. 

If these standards, mentioned above, were applied to 
the results presented in Table 1, the following events would 


receive the standard of excellent agreement between the judges 


in theirivatings< 


i eintercollegrate Meet 


Floor exercises 0.902 


2. National Trials 


Floor exercises 0.814 


Pommel horse 0.840 


Rings 0.863 
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Canada-China Meet 
Horizontal bar 
Parallel bars 
Pommel horse 
Rings 

National Meet: Senior Men's Finals 
Horizontal bar 
Pommel horse 

National Meet: Senior Men's Compulsories 
Horizontal bar 
Parallel bars 
Pommel horse 

National Meet: Senior Men's Optionals 
Parallel bars 
Pommel horse 

National Meet: Junior Men's Finals 
Parallel bars 
Pommel horse 

National Meet: Junior Men's Compulsories 
Horizontal bar 
Rings 


National Meet: Junior Men's Optionals 


none 


In total, nineteen of the fifty-four events rated 


0.913 
0.896 
0.896 


0.947 


0.851 


0.856 


0.942 
0.920 


0.947 


0.902 


0.936 


0.866 


0.814 


0.936 


0.905 
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received the standard of excellent agreement when using the 


scale defined above. 
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THE RELIABILITY OF THE JUDGES 


As a second purpose, this study attempted to 
estimate the reliability of each judge. The reliability 


obtained by factor analysis was previously defined by: 
i = hr + gs (5) 


In order to estimate the reliability of each judge, 
a principal components factor analysis was computed for each 
of the fifty-four sets of ratings. A uni-factor model 
(Overall, 1965) was assumed and the reliability of each 
judge was obtained by squaring their loadings on the first 
factor. It was assumed that the loadings on subsequent 
factors would account for error variance. 

For each event and for each competition, the 
reliability coefficients of the superior judge (S.J.) and the 


four other judges are presented in Table 2. 


Intercollegiate Meet: Edmonton, 1972 


From the results presented in Table 2, it was observed 
that the lowest reliability was obtained by Judge 2 in the 
pommel horse event (0.831). Out of twenty-eight coefficients 
of individual judge reliability, seventeen were above 0.90. 


and eleven were between 0.800 and 0.899. 


“Te must be noted that the judging panel was not 
necessarily composed of the same persons from one event 


to another. 


kh? ; | oe 
amd0b BHT YO ee en a 


oF hevgimate hind 2 eid seoqteg ed 6 “a 
ysifidsiis: sdtT  .sebst dogs to i Ciwinntgs ont 


(yd Geniieb yviavotyou esw shaviees. Eeey S| a & 


(2) ae i: ee : al 


; . wy. a 
0buf W588 Yo ysilidsils: sid seemijeo os sS4bxo at. 


/ wer | 

fives 162 betugm@os. esaw atevisane 103082 exnsnogs Lsaq-tor. 
9 ; — | seme 

fsbom zofont-inw A .opristses fo stee spb% 


fios® To yiilidsttex «A+ Sas LSeuees ean (2aer | 


- 
2 ‘ aie ; “4 Se a 
Jexit of ao apnibsel tied patisiipe "Vad heatlia 2 


susupseguve no zthnibsao! sis 4tsds besees =sw a me 
Soneixsv xoxts “to? SavoOoNS  o 


eit yaptatssamos dese 162 has gaads sone ‘to% 


ea2 Bas (.t.8) ee tOitsaue eft Fo asnisttieo8 yoithe 7 
. a 
an sidsT ni tpegaseatq 43s) eapbut x9 6: 


. | 


» 


INL. yobtaonbs 2293 otek ChE . 
2 Se TS Omps sae 


teinedo ssw jt .S aitat nt Dateauera Sa lueey odd — eee 
ods or. § spbut Ye Bonisido saw viilidpifay teewod ent 3 


22asioitisoo sipie-yinews to tO  .t TER) tisve aexod moc 


92.0 gvods oxew apeaeyen i ¥eebidsi fox, i vanes “4 


Jaks: G bas 008) * fbeitod sts 


srs See Bae 
€280™ % Sis "eat 


52 
TABLE 2 
Reliability of the Judges Estimated by 


Principal Components Analysis 


Event S'S0-3 1 2? 3 4 


Intercollegiate (WCIAA) Edmonton 1971 


F 0.973 0.836 0.930 0-952 0.952 
Hes 0.963 0.952 0.952 On0o5 0.927 
eb. 0.873 0.896 0.844 Oacs2 

1a 0.918 0.871 0.831 0.891 

R 0.964 02872 0.912 0.943 0939 
V 0596 0.944 0.883 Cag 2d 0.945 

National Trials: Winnipeg 1973 
E 0.981 MES 0.784 OO 1 0.862 
HB; 0.947 0.561 0.842 0.807 Oes7ou 
PB. 0.788 07929 0.546 Oahs2Z 02750 
Pe 0.924 0.903 0.958 02929 0.738 
R 0.951 0.958 0.895 0.986 O5653 
V 0.910 0.688 0.261 0.961 0.985 
Canada-China Meet: Montreal 1973 

F 0.800 OePES 0.909 0.900 0.788 
Heb. O97 0.954 0.979 0.965 0.915 
Pb 0.930 0.943 0.930 Oe 921 0.958 
ees | 0.976 0.941 0.948 0.942 0.980 
R 0.976 0.985 02958 0.984 0.920 
V 0.684 0.839 0.826 0.928 0.808 
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TABLE 2 (Continued) 


Event Seu. 1 2 3 4 


National Meet: Senior Men's Finals Toronto 1973 


F 0.734 0.542 0.303 0.688 0.810 
H.B. 0.865 0.993 0.931 0.962 0.949 
BB. 0.803 0.892 0.792 0.515 0.824 
PH. 0.992 0.934 0.754 0.909 0.955 
R 0.842 0.771 0.666 0.690 0.907 
V 0.920 0.726 0.895 0.903 0.933 
National Meet: Senior Men's Compulsories Toronto 1973 
F 0.944 0.923 0.957 0.914 0.849 
HeB: 0.956 0.948 0.956 0.964 0.965 
P.B. 0.985 0.976 0.899 0.965 02953 
ge cig 0.984 0.970 0.965 0.990 0.981 
R 0.939 0.919 0.895 05935 
V 0.816 0.866 O4 502 0.628 0.837 
National Meet: Senior Men's Optionals Toronto 1973 
F 0.948 0.955 0.838 0.934 
HB. 0.914 0.933 Oat h2 0.967 Ono.3 
B.B: 0.970 0.894 0.932 0.928 
Paths 0.949 0.978 0.959 0)..9318 0.934 
R 0.962 0.920 0.907 0.880 


V 0.801 0.769 0.845 0.736 
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TABLE 2 (Continued) 


Event Ses 1 2 3 4 


National Meet: Junior Men's Finals Toronto 1973 


F 0.804 0.649 Q.714 0.852 Ov 523 

H.B 0.959 0.933 O27 52 0.734 0.895 

BB. 0.980 Oo lg 0.967 0.938 0.683 

iB 0.937 0.921 0.599 0.939 0.857 

R 0.959 0.743 0.314 0.879 0.646 

V 0.886 0.433 0.913 0.977 0.806 
National Meet: Junior Men's Compulsories Toronto 1973 

F 0.926 0.815 0.865 Ooi 87 0.880 

H.B: 0.974 0.952 0.925 Oaoy7 0.946 

PaBi 0.897 0.960 0.788 R882 0.741 

Dan. 01955 0.925 0.903 0.908 0.928 

R 0.925 0.948 0.946 0.917 

V 0.925 0.893 0.920 0.943 WEE AYAS) 
National Meet: Junior Men's Optionals Toronto 1973 

F 0.941 0.943 0.868 0.917 

Heese 0.902 02.956 0.846 0.895 0.927 

Deb. 0. 917 0.949 0.912 0.900 

Pa He 0.892 0.947 0.922 0.859 0.825 

R 0.936 0.929 0.867 0.849 

V 0.962 0.900 0.899 0.898 
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National Trials: Winnipeg, 1973 
For this competition the lowest coefficients of 
individual judge reliability were obtained by Judge 2 in the 
vault event (0.261), Judge 2 in the parallel bars event 
(0.546) and Judge 1 in the horizontal bar event (O2561):. 
Out of thirty coefficients, fourteen were above 0.900, five 
were between 0.800 and 0.899, six were between 0.700 and 


0.799 and five were below 0.700. 


Canada-China Meet: Montreal, 1973 

In this competition the lowest individual judge 
reliability was obtained by the superior judge in the vault 
event. Of the thirty coefficients estimated, twenty-three 
were above 0.900, four were between 0.800 and 0.899, two 


were between 0.700 and 0.799 and one was below 0.700. 


National Meet: Senior Men's Finals 

For this competition the lowest coefficients of 
individual judge reliability were obtained by Judge 2 in the 
floor exercises event (0.303), Judge 3 in the parallel bars 
event (0.515) and Judge 1 in the floor exercises event (0.542). 
From the thirty coefficients obtained, twelve were above 0.900, 
seven were between 0.800 and 0.899, five were between 0.700 


and 0.799 and six were below 0.700. 


National Meet: Senior Men's Compulsories 


Out of twenty-nine coefficients of individual judge 


reliability for this competition, twenty-one were above 
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0.900, six were between 0.800 and 0.899 and two were below 
0.700. The lowest coefficients were observed in the vault 


event for Judge 2 (0.512) and Judge 3 (0.628). 


National Meet: Senior Men's Optionals 


The lowest coefficient of individual judge reliability 
was obtained by Judge 3 in the vault event. Of the twenty- 
six coefficients obtained, eighteen were above 0.900, five 
were between 0.800 and 0.899 and three were between 0.700 


and 0.799. 


National Meet: Junior Men's Finals 

In this competition, thirty coefficients of individual 
judge reliability were obtained. Eleven were above 0.900, 
eight were between 0.800 and 0.899, four were between 0.700 
and 0.799 and seven were below 0.700. The lowest 


coefficient was observed for Judge 2 in the rings event 


(0.314). 


National Meet: Junior Men's Compulsories 


For this competition the lowest individual reliability 


coefficient was obtained by Judge 4 in the vault event. 
Out of the twenty-nine coefficients obtained, nineteen were 


above 0.900, six were between 0.800 and 0.899 and four were 


between 0.700 and 0.799. 


$iuev of? at bromo ree omens feet 
(893.0) °F sipbat here, tert $. ppbot 3 


elepoige 2 He 10g, 192 ijgeM Les 

yiilidsilfes ephar Lleebivibni to ingigiiee +aawol at a 
-ysuews ag 70 ndve +tiasy og ak © eaypbut. va ‘pantie 
evit ..08@.0 svods siew desiidpio bbrsh nado synsisirh r “ 


OOv.0 cideted* suéw.cestd bois 08055 bre 00010 ae 


elentt a nok  A@iawL 33 

ieubividnt to atasicoilieoo ytrinds ,aotsigeqnies ehdy al 
902.0 suede ayew nevalS  bontsddo snow wilideites t 
G0°.0 mesided sxew mot 828. 5 stn 086 ed Sa 
saswot ad? .O6% .0 rolled cele’ sail: lg Oa 

Sneve aytit edd ni { ephyt 168 Sevzeade asw a: 


a Ne! 


=e i poe+ eqn Boy 3) ae nine wt: ba My. & 
cnhter tailed eow0% ans noisisoimin aids sa 
Paavo sitiey ed? Ai > oubUT va boninsde enw | 
sari Spaliectio adapt otti90> enin-yonews ¢ 


‘ose es ta eto 0 aw 08.0 neouted. oxsw x18 ery 


rs 4% 
| Ce eae ae 200.0 bes BOT 
Mite a: % ie i ct ft a. . as 7 aaa i 
ie ) | 
i oer ee ‘ 


fea, we es 
koe 


Sey 

National Meet: Junior Men's Optionals 

Of the twenty-six coefficients of individual judge 
reliability, the lowest one was obtained by Judge 4 in the 
pommel horse event (0.825). Sixteen coefficients were above 
0.900 and ten were between 0.800 and 0.899. 

In summary, 258 coefficients of individual judge 
reliability were computed and 151 were above 0.900 (58.53), 
62 were between 0.900 and 0.899 (24.0%), 24 were between 


Oy 700 and 0.799 (9.3%). and 21 were below 0.700 (8.1%). 


The Uni-Factor Model 

It was also part of the individual judge reliability 

problem to test for the uni-factor model (Overall, 1965). 
The first lower bound. estimate of communality as defined by 
Guttman (1954, 1956) was used as the decision rule for the 
number of factors to retain which corresponds to the number 
of eigenvalues greater than one. 

For each event and for each competition, the largest 
eigenvalue (Ajde the second largest eigenvalue (A5)5 the 
proportion of the total variance accounted for by the largest 
eigenvalue (054) and the proportion of the total variance 
accounted for by the second largest eigenvalue (on) are 
presented in Table 3. 

For all fifty-four principal components analyses 
performed, only one factor was retained since there was no 
more than one eigenvalue greater than one for each analysis. 

For all the events and for each competition, the mean 


of the proportion of the total variance accounted for by the 
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TABLE 3 


Eigenvalues and Proportion of Variance. 


A Test of the Uni-Factor Modél 


Events ‘ i\ %o? g02 


Intercollegiate Meet: Edmonton 1971 


F 4.644 0.204 92.879 4.086 
non. 4.689 0.143 93.781 2.862 
P.B. 3.485" 0.244 87.119 6.104 
P.H. Sagi 0.248 87.751 6.203 
R 4.629 0.166 92.572 32396 
V 4.610 0.171 92.205 3.424 
National Trials: Winnipeg 1973 
F 4.423 0.304 88.451 6.080 
H.B. 3.918 0.668 78.362 13.369 
P.B. 3.781 0.812 75 +625 16.232 
P.H 4.452 eee s 89.037 9.102 
R 4.442 0.481 88.841 9.626 
Vv 3.805 0.988 76.095 19.765 
Canada-China Meet: Montreal 1973 
F 4.112 0.434 82.233 82676 
H.B. 4.730 0.170 94.595 3.399 
P.B. 4.683 0.124 93.665 2.486 
PH. 4.787 0.110 95.743 2.193 
R 4.822 0.104 96.447 2.078 


V 4.084 0.444 81-675 8.880 


“9nasixsV io noitxogord, bas esploevaspia. 
t, rh 
{ee xecosT-iaU sat \% jest 4 
: : . 2 
oi res or i? “qi 
Se eee ee ee Oe ee = — 
' t) 
itel sostnombhbsy. :toam oinkpettostadar ih 
aeo, > ere, Se BOL 0: bBaLD 
gaeut £8 Ee EGLO @¢a.b 
bOL.a @£1 08 BOS 0 “BQD.£ 7 
£053 fey v6 BSS .0 “£210 |=, 
act Je $T2.e0 vol.0 a 
ah. € 205.58 £tE.0 Oia)" ” 
éSel PeuianiW selsiT tenotten | 
ong.d  «=—i“(<iséséd CCE CO CE ee 
@df.E5/\ (Be Eh) Galo oS OEE 
ceS.8k | TeSoaet IO eye, 
ore Nene et mee 
P8089) LR fiat See. ete 
i ae B08 208... ae 
+f a ae rE. | 
; J) : ra ; ’ 
ents { a 
‘a 398m Bei kbs ; 


‘Site t r 


Events 


Senior Men's Finals: 


3.078 
4.700 
3.932/ 
4.544 
S205 
4.377 


4.586 
4.789 
4.778 
4.890 
* 
32067 
3.659 


* 
3.676 
4.500 

* 
3.724 
4.758 

* 
Sy (oily) 

* 
i ipa is 


TABLE 3 


G2 679 
0.223 
0.725 
0.334 
02629 
0.389 


0.225 
0.102 
0.136 
0.047 
0.142 
0.626 


0.212 
0.288 
0.164 
(Dery On) 
0.189 
0.401 


(Continued ) 


Toronto 1973 


61.568 
94.007 
76.534 
90.888 
77.507 
87.536 


SE TAS 
Sie ysay i ah 
95.564 
97.809 
S221) 5 
73.174 


Senior Men's Optionals: Toronto 1973 


97-897 
So 920 
GS-09 
95.160 
wig 798 52 
18. a 


reo T 
4.459 
14.503 
6.680 
12.584 
(as Tasks) 


Senior Men's Compulsories: Toronto 1973 


4.499 
2.047 
Zul2ad 
0.944 
Be iruia) she) 
UZ 2) 


S207 
oy (AS) 
4.112 
Zest 
4.722 
10.015 
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TABLE 3 (Continued) 


Sn ha ee 
ee ee ee en ne ae eee 


E 
vents Ay do Oy c 
ocean ce ee eh eh ee i ae 


Junior Men's Finals: Toronto 1973 


F 3.540 0.847 70.810 16.943 
H.B 4.273 0.377 850455 sae 
P.B. 4.446 0.434) 88.916 8.673 
P.H 4.253 0.552 B55051 11.044 
R aes 4i 0.961 70.817 19.219 
V 4.015 0.663 80.293 13.269 
Junior Men's Compulsories: Toronto 1973 
F 4.273 0.338 85.465 pares 
H.B 4.768 0.101 95.363 2.010 
P.B. 4.218 0.319 Sa Ge uaa 
P.H 4.620 0.149 92.407 2.976 
R RES 0.111 93.368 2.785 
Vv 4.404 0.348 88.083 6.967 
Junior Men's Optionals: Toronto 1973 
F aP670" 0.221 91.741 52516 
H.B 4.522 0.199 90.441 3,982 
P.B. Dtarhoe 0.158 91.949 3.951 
ae 4.445 0.241 88.908 4.815 
R a5560° 0.212 89.504 5.295 
V 3.660 0.160 91.494 3.989 


“obtained from the ratings of four judges. 
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largest eigenvalue was as follows: 


1. Intercollegiate Meet 91.0523 
2. National Trials 82.735% 
3. Canada-China Meet 90.726% 
4. National Meet: Senior Men's Finals 81.3403 
5. National Meet: Senior Men's Compulsories 91.0373 
6. National Meet: Senior Men's Optionals 90.1113 
7. National Meet: Junior Men's Finals 80.223% 
8. National Meet: Junior Men's Compulsories 89.8423 
9. National Meet: Junior Men's Optionals 90.673% 


Discussion. From the individual judge reliability 
coefficients presented in Table 2, the average judge 
reliability coefficient was obtained for each event in each 
competition and these averages are presented in Table 4. 
Since each individual judge reliability coefficient was 
determined by squaring the loading of each judge on the first 
factor, the average judge reliability coefficients presented 
in Table 4 correspond to the proportion of the total variance 
accounted for by the largest eigenvalue in each event. These 
proportions are shown in Table 3. 

It should be pointed out that the coefficients for the 
average judge reliability as obtained by factor analysis are 
slightly higher than the coefficients of the average 
reliability of the ratings (Ry unadjusted) estimated by the 


analysis of variance. In the analysis of variance the 
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TABLE 4 
Average Judge Reliability for Each Event 
in Each Competition 


ee a ee ee ee ee ee ee ee I ee, ee ae te 
Se DS RN SS 1 eee EO 


Average Average Average 
Event Judge Judge Judge 
Reliability Reliability Reliability 
Edmonton Winnipeg Montreal 
F 0.928 0.884 0.822 
Heo. ao 0.783 0.946 
P.B. Oe 7 0.756 0.936 
P.H OF377 0.890 0.957 
R 0.926 0.888 0.964 
V 0.921 Or6. 0.87 
National Meet: Senior Men 
Finals Compulsories Optionals 
F O'.'6:55 On O17 0.918 
H. 8 0.940 0.957 0.899 
Poo. 0.765 07905 0. 9oL 
PaH< 0.908 05978 Os952 
R 0.775 Qec22 0.917 
V 0.875 Qsvot 0.787 
National Meet: Junior Men 
Finals Compulsories Optionals 
F 0.708 0.854 On 917 
Hie, 0.854 OaI53 0.904 
P.B. 0.889 0.843 0.919 
P.H 0.850 0.923 025889 
R 0.708 0.934 0.895 
V 0.803 0.880 0.914 


; a r- j ' ee = 
Ss | 
Ineva rosa. tox Ancelenen ay spesova 
nels iseqmoD Kee ne 
- eae ae are ee es — Si ociesetce nee 
SpsusvA  SPETSVA 
Sent sphwt 
yeilidsiies Wi berset hon be Ten. 
isox3za0M. peatinlw j cance 
go6,0 BRD Bg°.0. 
pee. EY .0 T£e2.0 
dR .0 aey.0 rve.e 
T2e .6 0€8.0 vv8.-0 
bBe’.0 888, 0 ae. 
Tis .0 iat 0 ise.0. ay 
GoM Yolns2, :IS9M-lanoty pi 
i's t 7 i 
zisnottqo @olxoelignss afena't 
afe.0 ¥¥e20 i. 
£0820 Fee 0 | ‘nee. 6 | 
ize@.9 ave.0 edn 
VLE .0 thn SEE.0 RC 
| SEL | £ET.0 | ‘eve, 0 


pene Cre : 266M Leni teu 
nettoatyaine’  eltengs 


pe cea 


63 


observed ratings of the judges are manipulated without any 
differential weighting (Burt, 1955) whereas differential 
weights are used in factor analysis. A comparison of the 
reliability coefficients obtained by these two techniques 
was studied by Mahmoud (1955) and the author suggested that 
the coefficient obtained by factor analysis "provides the most 
appropriate estimate of reliability". It was also suggested, 
in situations where non negative weights are found, that the 
true components would contribute to a greater proportion of 
the total variance. Consequently a greater estimate of the 
reliability coefficient would be observed. 

As was done with the average reliability of the 
ratings, standards of excellence could be established for the 
individual judge reliability coefficients. Similarly we 
could arbitrarily select a coefficient of 0.90 and above 
to qualify an excellent rater in situations where a large 
range of ability exists among the athletes. Coefficients 
between 0.80 and 0.89 could be selected to define an excellent 
rater where the range of athletes' ability is much narrower. 
When these standards are applied to the results presented in 
Table 2, we find that 175 judges (67.8%) received the standard 
of excellent rater. 

Furthermore, the quality of the judging of an event 


could be assessed by using the proportion of the total 


variance accounted for by the largest eigenvalue extracted 


from the factorial analysis. As before, 80% of the total 
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variance could be selected to represent excellent judging 

in situations where the athletes are more homogeneous in 
terms of their abilities, and select 90% to represent excellent 
judging of competitions where the athletes are more 
heterogeneous. When these standards were applied to the 
results presented in Table 3, it was found that eight events 
out of twenty-four were below the standard of 803% for the 
National Trials, the Canada-China Meet, and the finals for the 
Junior and Senior Men at the National Meet. In the other 
competitions analysed, ten events out of thirty were 

below the standard of 90% to represent excellent judging. 

The same standards could also be applied to a 
competition as a whole. In this case, it is suggested that 
the mean of the proportion of variance accounted for by the 
six events in the same competition be used. (These averages 
were previously presented on page 61). 

Applying these standards to the nine competitions 
analysed, only the compulsory competition for the Junior 
men at the National Meet fell below the standard of 90%. So, 


for eight of the nine competitions the standard of excellent 


judging can be applied to the ratings. 
PERFORMANCE SCORE ASSESSMENT METHODS 


It was also a purpose of this study to compare four 


methods of assessing the performance score of each athlete. 
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The first method dealt with the weighted composite 
score. As previously stated, the characteristic equation of 


the correlation matrix R was given by: 
(R- ATI)w = 0O (6) 
Since the variance of the weighted composite was given by: 
oa? = w' Rw (7) 


and that 


A =w' Rw (8) 


then the largest eigenvalue of the correlation matrix was 
equivalent to the variance of the weighted composite. 
Associated with the largest eigenvalue was a vector of 
weights (w) to be applied to the standard Zscores to form 
the weighted composite. In order to maximize the variance 
of the weighted composite and obtain a unique solution, the 


constraint that the sum of the squares of the weights equals 


1 was introduced, that is: 


w'W weal (9) 
The weighted composite in standard score form was obtained 
by: 

Cm aaa en. WN plied (10) 


where C is the weighted composite 


Z is the matrix of Z-scores 
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w is the vector of weights associated with the 
largest eigenvalue 


A is the largest eigenvalue. 
It was then necessary for purposes of comparison with other 
score assessment methods, to rescale the weighted composite. 
This was done by multiplying the weighted composite by the 
mean variance of the judges as obtained from the ratings. 
The grand mean of the ratings was then added to the rescaled 
weighted composite. 

The second method of performance score assessment 
consisted of taking the arithmetic mean of the ratings 
received by each subject in order to obtain an unweighted 
mean score. 

The F.1I.G. mean Score was obtained by averaging the 
middle two scores of the four judges, excluding the superior 
judge's score. This became the third method of score 
assessment. 

In the fourth method, the performance score was 
determined by averaging the ratings given by the two most 
reliable judges. 

A Kendall rank correlation (Siegel, 1956, p. 213-222) 
was then calculated between the four score assessment methods. 
The results are presented in Table 5. 

As a general observation, very high coefficients of 
correlation were obtained between the different score assess- 


ment methods. However, low coefficients of correlation were 
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72 
obtained in some situations where a small number of athletes 
participated in the event. For example, the coefficients 
of correlation between the scoring methods for the horizontal 
bar event (National Meet, Senior Men's Finals) were 1.000, 
0.333, 0.667, 0.333, 0.667, 0.667, but only four athletes 
participated in that event. 

With a large number of subjects the rank correlation 
coefficients tend to be very high. For example, the lowest 
rank correlation coefficient at the National Meet, Junior 
Men's Optionals was 0.839, for the Junior Men's Compulsories, 
0.827, the Senior Men's Optionals, 0.787, and for the Senior 
Men's Compulsories, 0.802. For the Intercollegiate Meet 
the lowest rank correlation was found to be 0.871. 

For each combination of scoring methods the mean of 
the rank correlation coefficients for the six events within 
the same competition was obtained: For seven of the nine 
competitions analysed, it was aoe that the mean rank 
correlation between the weighted composite score and the 
unweighted mean score was higher chan the mean rank correlation 
of any other combination of methods. An exception to this 
was the mean rank correlation between the unweighted mean 
score and the F.I.G. score at the National Trials, and between 


the weighted composite score and the average score of the 


two most reliable judges for the Junior Men's Finals at the 


National Meet. 
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73 
In summary, small differences in the rank correlation 
coefficients were generally observed among the four methods 
of score assessment. Greater differences were found in 
situations where a small number of athletes participated 
in an event. These results, however, do not clearly indicate 


the superiority of one method over another. 
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CHAPTER IV 


SUMMARY AND CONCLUSIONS 


Summary 


The foremost purpose 
the reliability of gymnastic 
of gymnastic raters. It was 
to compare different methods 


score of each athlete. 


of this study was to estimate 
ratings and the reliability 
also a purpose of this study 


of assessing the performance 


The ratings of four competitions were analysed: 


1. Intercollegiate Meet: Edmonton 1971 


2. National Trials: Winnipeg 1973 


3. Canada-China Meet: Montreal 1973 


4. National Meet: Toronto 1973. 


In order to estimate the reliability of the ratings, the 


analysis of variance was used and standards of excellence of 


the ratings were suggested in relation with the average 


reliability of the ratings. 


of 0.90 and above be used to 


It was proposed that a coefficient 


denote excellent ratings in 


situations where a large range of ability exists among the 


athletes. When the range of 


ability of the athletes is smaller, 


a coefficient between 0.80 and 0.89 has been suggested. 


The reliability of each judge was obtained by the 


principal components method of factoring, Equivalent 


Standards as above to qualify an excellent rater were suggested. 
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Furthermore, it was proposed that the proportion of the total 
variance accounted for by the largest eigenvalue be used to 
assess the quality of the judging of an event. It was also 
suggested that the mean proportion of the total variance 
accounted forby the largest eigenvalue for the six events 
be used as an overall indication of the quality of the 
ratings for a whole competition. 

Finally, four methods of assessing the performance 
score were compared. However the results obtained did 


not suggest the superiority of one method over another. 


Conclusion 

Results of this study indicate the feasibility of 
assessing the quality of the ratings and the raters in order 
to identify competence and objectivity in gymnastic judging. 

From the coefficients of average reliability of the 
ratings (Table 1), it was observed that there was little 
difference between the unadjusted and the adjusted coefficients. 
This could be accounted for by the similar variability in the 
judges' scores and would also indicate as a general rule that 
one judge does not tend to rate the performance higher or 
lower than the other judges. 

It is assumed from the classical test theory model, 
that the average of a larger number of ratings would give a 
better estimate of the true performance score. However, this 


study gave no indication that a new scoring method should be 
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suggested to improve on the actual F.I.G. system. 

Estimate of raters' reliability could be used to 
establish a profile of ability for individual judges which 
could be then integrated with existing certification and 
promotion programs. Similar profiles could also be drawn 
from the reliability of the ratings. 

After having identified the status of Canadian judges 
in terms of the reliability of their ratings, more research 
seems to be needed to study the factors which may influence 
the variability of judges" scores. Because the judges do not 
view a performance from the same angle and the same position, 
do they really assess the same performance? This aspect of 
the validity of the ratings is certainly worthy of 
consideration. 

It was observed that lower coefficients of reliability 
of the ratings and the raters were obtained for the final 
competition at the National Meet. Since the final was held 
at the end of the third day of competition, it is suggested 
that a fatigue factor on the part of the judges might have 
affected the quality of the ratings. This area of investi- 
gation could be the object of further research. 

Finally, since human judging is employed in other 
areas of athletic competition such as skiing, skating and 


diving, a comparison of the reliability of their different 


systems of rating could be valuable. 
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